Introduction to Flex and Bison

Joe C. Solinsky <jcsky@cs.UCR.edu>

Introduction

There are a couple of languages out there that come with the GCC distribution on Aminet which most of us will overlook as some sort of enigmatic utility, unless they are pointed out. Flex and Bison are a team of languages which sit on top of C or C++, and do a lot of the dirty work in lexical analysis. If Flex is used just by itself, your program can recognize single words or perform pattern matching based on what are known as regular expressions. If Bison is used on top of that, your program can handle grammar structures to deal with an increasingly complex input, to the level of being able to write your own programming language (compiled in C, of course), if for some reason you would want to do this.

I won't pretend to explain either one of these languages to any useful level, because they are rather lengthy, but I hope that these basics will give you a feel for what is going on as you read the core of this article. Please read the extensive and useful AmigaGuide documentation which is in the GCC 2.6.3 distribution. I found it to be more useful at times in a hypertext format than the traditional books by O'Reilly & Associates, although I am not about to knock their book (which have the auspicious title of Lex & Yacc, based on the minutely different predecessors of Flex and Bison named respectively), and I admit to reading it cover-to-cover (excepting the appendices which seem to be mainly for the 'old salts' of previous lexical analyzer tools).

Flex

The input to Flex can be any standard input stream in C, including direct user input via an AmigaShell window,text file, or the output of another program (and possibly other ways of generating input that I have not thought of). In the input, ANSI characters are considered valid, so from the theory point of view, your "alphabet" is all ASCII letters, numbers, whitespace, and punctuation can be dealt with. Since the source code for these programs is available, it is possible to modify Flex to accept other symbols, and possibly with a bit of effort, binary files, but I personally wouldn't try that without extreme confidence in what was to be done.

Bison

Bison accepts the tokens that Flex can generate, which are read in from a table of tokens generated by Bison, with the use of the %token command. It also recognizes single characters, so there is some overlap with Flex, but consider it the second stage of a two-part process, not the end-all lexical analyzer, and you will save yourself much grievance upon this inevitable conclusion. It is a really really good idea to double-check the case and spelling of the return values of Flex and the tokens that Bison recognizes, as they must be properly identified at the top of the Bison program (on top of using the standard include file bison.tab.h , which is the table I just mentioned), because technically speaking, the include file bison.tab.h is built off of the return values of the Flex program. Bison senses these tokens as integers with the use of C's #define preprocessor instruction, as you can observe in the include file, so if you chose to do something with numbers, be aware of the dreaded "magic numbers" pitfall of programming constants into your code.

Bison and Flex programs by themselves wouldn't compile under C. As I said before, they are preprocessor languages (meaning the Flex and Bison Code must be processed first), but they generate C source code (which is a typically large file, I might add, so be prepared for that) which can be compiled with your GCC compiler, or other compilers, assuming you are doing things which are pure to ANSI C, and do not reference things specific to a given compiler or Operating System (otherwise, you are stuck preprocessing on your Amiga only!). One benefit of this flexibility through standards is that should your Flex and Bison preprocessor code become lengthy, the task of turning it into ANSI C could be left up to another, more powerful machine, and with the blessings of a good modem, this machine could be anywhere, like the monster server at work or school which always seems to have spare CPU cycles for an avid programmer like yourself. This time is considerable for larger programs, and if you are doing fancy tricks, you will soon see that the preprocessing can be a considerable chunk of time on your Amiga.

The GCC AmigaGuide documentation on Flex and Bison make reference to some rather simple calculators and a Pascal language recognizer. This is primarily because these are the classic examples of what Flex and Bison are good for. The real purpose of this article is to get past that, and focus on the application of these two languages, not the instruction itself(how could I possibly top GNU documentation?). As I worked with these languages at my University, I discovered that state machines (which is the theoretical machine which they are based upon) are capable of much more, and should be utilized to their fullest. Only a modicund of work is necessary to do powerful things with these languages, and I feel it adds to the community of the Amiga if we, the programmers, show off how clever we are, to the benefit of our fellow end-users.

Anagrams

Just a week or so before I began this article, I saw a posting on comp.sys.amiga.misc which listed "Top Ten Anagrams" for the title of Amiga Technologies, the entity which brings us newer, faster Amigas from Germany. I pursued the poster of the article to find out where the anagrams came from, as I was suspicious that it was done with a computer. Sure enough, there was a WWW site which would take a string in and output as many anagrams as it could form from the string. It didn't take me long to figure out that the Flex program to do this was conceivable, through the theoretical construct of set notation:

Set 1 is all tokens which are words in the English language.

Set 2 is all tokens in Set 1 which use the characters in the input string, but no more than once per occurence, otherwise you would end up with "too" as a valid word formed from the input "to", which is impossible, because "to" has only one 'o' in it to begin with.

The English language is large. That's why we made Set 2. I realize this can work for any language, but there would need to be modifications made to Flex to recognize things like umlauts and estsets, much less all of Kanji. Any language spoken by a culture is going to be computationally large. That is why the program to make these anagrams must focus itself to only valid words.

Once one has Set 2 defined, the following must occur to finally get the anagram:

Start with the largest word (sized by the number of characters in it) constructible by the input string in Set 2, then remove those letters found in that first word from the input string, like you were crossing them off an imaginary list of letters that you have used up.

Pass what's left through the code that recognized the first English word that was found, doing this recursively.

If you use up all the letters in the original input string, then print the anagram. Otherwise, you have leftover letters, and did not succeed in making an anagram.

Permanently remove that first word from Set 2, once you have exhausted all attempts at making an anagram with the first word used first. Why? Well, you will end up with every possible order of words for every set of words which fits the anagram, and much of this is not something that needs to be done with the parser (making various orders of the words in the anagram does not involve parsing, it involves sorting), and besides, this is going to be a much larger list. I suppose this is up to you, but passing just my full name resulted in 45 different anagrams, where each anagram was a unique set of words, and had I opted to have all the pattern orders calculated, the number of responses would be roughly 45 times the factorial of the average number of words in each anagram.

After removing the first word of Set 2, pass the second largest word constructible by the input string in Set 2. If you repeat as you did above for the first, using recursion, you will ultimately get all unique anagrams that can be made from the input string.

Scanning for words

This little project is a fairly short program to write in Flex, and it would be significantly more challenging to write in C. Unfortunately, the code for this program is nowhere near as intuitive as the breakdown of the problem. Just to give you something that is a bit easier to approach in terms of solving on your own, in case this puzzle has you stumped, think of a program in Flex (only) that scans the words of the English language and returns only those which have the letters A, E, I, O, and U occuring in that natural order (interspersed with consonants, of course). "Facetious" is one such word. The solution for this problem is quite short.

The first step is to make a single predefined set of all consonants, and then reference it inside the rules. The set looks like this (CONSONT is my name for the set):

CONSONT [bcdfghjklmnpqrstvwxyz]*

And it can be referenced in the rules section to solve the puzzle in a single line:

a{CONSONT}e{CONSONT}i{CONSONT}o{CONSONT}u{CONSONT}

Although, realistically, the puzzle needs a few more lines for housekeeping purposes, these two lines are the 'creative' part of the problem.

A complete Flex input file, with C++ main(), is included as puzzle.yy. It can be tested with the commands
flex puzzle.yy (which creates a C++ source file named lex.yy.c)
cc lex.yy.c (or sc for SAS/C, of course)

Flex is quite useful for problems that deal with formatted input. Bison, on the other hand, is better for complex structural input. The examples mentioned earlier, like those of a language parser or a calculator do indeed show that Bison can handle tasks which would be difficult in Flex, and require programming in Flex's own state machine commands, which would tend to require redundancy for every input in every state which is not relative to a valid input pattern (or collection of tokens that forms a pattern recognized by a grammar). Take a look at the examples in the AmigaGuide documentation for Bison, and see what I mean.

Uses for Flex

Perhaps you are asking yourself "why do I want to program another calculator?" or "I don't even use Pascal anymore, why would I want to parse it?" as I did when I first mulled over these examples. The answer most likely is that you are not. Flex and Bison together are tools that we as programmers can use to make our work easier (and with them, perhaps someday we can have a program which solves crossword puzzles by brute force permutation). The most common programming tool that uses these two languages is what is known as a pretty printer. Basically, this is a program that makes appropriate use of white space in our source code, and indents and flushes our code so that when reading it, the lexical levels (scope levels) appear as successive indentations, our comments line up neatly, and functions that wrap around to the next line do so neatly. But, there are pretty printers out there, and there's one in GNU Emacs that runs on the fly as you code, part of the text editor itself (written in Lisp, I believe). Lexical analyzers can be made to preprocess your source code and check for bugs in implementations of libraries, giving you feedback where your compiler misses details (or gives you nearly meaningless error codes like "parse error," which tells you your grammar seems to be off). They also appear in newer word processors (on the IBM platform, as far as I have seen) that check to see if the word you are using is in its dictionary of the English language, and fix the spelling for you on the fly. Unfortunately, this isn't a good way to improve your spelling skills, and often, it doesn't handle the actual use of the word (their as opposed to there), leaving you with not a spelling error, but scattered English grammar. Command shells can also be built with lexical analyzers, and as we have seen with programs like KingCon and others, a lot of control is added over the standard AmigaShell.

The development tool I would like to suggest experimenting with is meant for the ambitious (for the non-ambitious, work on the word puzzles until you are ready) programmer who wants to make lexical analyzers work for him/her in the professional arena. Often times, as we are faced with developing software, there is a tremendous amount of documentation which is coupled with what needs to be written in the actual program, and comments galore. As humans, we programmers face the daunting task of remembering what var1 is, and why it is equal to my_input1 (good examples of bad variable names), and in the process of software development, we forget, again and again. We make silly mistakes because it's 3:00am and the side-effects of too much coffee are making us jittery and distracted.

Documentation and the classical techniques of software development are designed to help us, not waste our time when we could be debugging. If we were to somehow encorporate documentation with software development, and then reference the documentation as we code our software, a lot of pitfalls could effectively be removed from the software development process. The next question, naturally, is how.

The answer is totally dependent on your views, but here is my perception of things, and I hope it can spark even better ideas as you read. When I develop software, I recognize a few different stages of it. There's the analysis and specifications phase, where the idea is nailed down and copied off the white board and onto paper. Then there is the design phase, where the implementation of the idea concerns itself with issues of input devices, methoods of response, machine specifications, application appearance and operation, and other issues of the software as the user would see it. After that is established, the architecture, what is behind the design, is hammered out. Functions are given names and purposes, data structures are contrived and outlined, and program control is carefully mapped out. After the program is structured, the application then goes through a process of narrowing focus on each function, algorithm, and manner of execution, until finally there is a program sitting on the hard drive which you (and your development team) have written, and you are positive it is worth something because you tested it and it works and it is what you wanted.

Lexical analyzers can come in handy as the programs by which programs are written under. With enough flexibility, a good software designer application can address all of these phases of software development with the general principal of question and answer. Take the following interaction as a possible way that this tool could behave:

Ask the designer for a few sentences on what the program to be written will do.
Ask the designer to expand the abstract of the program when it finds certain key words

user: I want to write a video game where you are flying this ship through a jungle and shoot at aliens. The aliens fly at you in different waves, and get harder to shoot. Also, as you go along, you can pick up weapons and make your ship faster, get extra lives, and bonus points. There will be a main boss enemy at the end which is hard to kill.
design tool: So you are writing a 'game,' eh? Good idea. When you described it with 'you are flying,' what did you mean by 'you?'
user: the player.
design tool: Then it should read like this, right? 'I want to write a video game where the player is flying'
user: yes.
design tool: How about 'aliens fly at you,' is that the same 'you' as 'the player'?
user: yes.

[and so on, until every use of the word 'you' has been clarified. When the meaning of the word 'you' branches, the computer starts asking which 'you' it is based on a larger range of possibilities.]

design tool: So, as I understand it, your 'game' can be described as:; I want to write a video game where the player is flying this ship through a jungle and shoots at aliens. The aliens fly at the player in different waves, and get harder to shoot. Also, as the player goes along, the player can pick up weapons and make the player's ship faster, get extra lives, and bonus points. There will be a main boss enemy at the end which is hard to kill.
design tool: Are you happy with this definition, or do you want to make corrections?
user: It's fine. [or any variety of acceptable responses]
design tool: Alright, let's move on.
design tool: Tell me more about the 'ship' you mentioned here: 'the player is flying this ship'.
user: It's a spaceship. It is an Alpha Class fighter.
design tool: Is this the same ship as referenced in 'make the player's ship faster'?
user: yes.

[and so on. The idea is to expand definitions as much as possible to avoid ambiguity, and create appropriate handles for the design tool, as well as save the user from writing documentation which isn't intuitively defined. In the end, the short description could look something like this.]

I want to write a video game where the player is flying this Alpha Class fighter-spaceship through a jungle on the planet Zarkan and shoots weapons such as missiles, lasers, and nuclear warheads at three-headed space aliens, giant snake space aliens, blood-sucking vampire aliens, and octopus aliens. The same aliens fly at the player by weaving back and forth, dancing across the edges of the screen, darting in from the corners, and materializing randomly on the screen, in different waves, where the same aliens will fly in squadrons of the different kinds of aliens and in different numbers, and it gets harder for the Alpha Class fighter-spaceship to shoot its weapons at the same aliens. Also, as the player goes along the jungle on Zartan, the player can pick up more powerful missiles, broader lasers, and more dammaging nuclear warheads, and make the Alpha Class fighter-spaceship manouver on the screen faster, get extra lives which allow the player to start where the player was killed and continue, and bonus points which increase the player's overall score. There will be a main boss enemy alien that attacks the Alpha Class fighter-spaceship at the end of the jungle on Zartan, and the boss enemy alien is hard for the player to kill.

At this point, you are probably saying to yourself "okay kid, let's see this wonder program of yours so far, written in something which has previously only been demonstrated as a calculator builder." Right you are, this is daunting and ambitious. Here's how I would do it:

By establishing certain rules of the English language, we can familarize ourselves with possible ambiguities. The most obvious one is the use of pronouns in a sentence. Take this sentence as an example:

When my brother took me to see our dad, he said that mom was going to be late for dinner.

Who spoke? It is totally vague. It should logically be the brother because the brother was the subject of the time clause, and therefore has more weight than the infinitive phrase 'to see our dad,' but it could very well be the dad who spoke. Watch for this in conversations, see how many times you have to ask whether it was the dad or the brother (metaphorically speaking) who spoke.

By first addressing this ambiguity, the specifications document can become significantly clearer, almost as if you had someone there to proofread for you. In the English language, there are happily a finite set of words that can be classified as pronouns. It is relatively easy to parse them out and poll the user to specify what is meant at each instance, then replace the pronouns with only the object of the sentence which was typed in as a response to reference the pronoun(and parsing the English language is easy enough with Bison, because a good writing style handbook will have some notes about sentence structure, and this can be translated into something Bison can recognize). Oh, sure, this application can be fooled with enough trickery, but that's why programs come with instructions telling you not to use the subjunctive when addressing the program (or at least some mechanism in the parser which produces an error message about the grammar being too complex in the response, for the parser to understand, and poll the user again).

With this mechanism of polling responses and doing textual replacements, all the pronouns can be clarified. In addition, the implied subjects of gerunds can be specified where the structure of the sentence makes it vague, and it can also correct poor grammar (as a note, Final Copy II does grammar checking, so don't think this is unheard of).

Once things are laid out, and everyhing is undeniably under only a single interpretation when read, you will have a pretty solid document. Why go to so much trouble? Well, the original reason for this is so that people who read it don't have to ask questions like "who said mom was going to be late for dinner? I'm not sure which person is saying that." The additional reason revolves around the fact that all documentation in this application builder is functional.

The document can slowly grow into something complete, to the point where no further questions can be asked regarding details of the program. For example, at one point, I mentioned nuclear warheads in my sample run. These are considered weapons that belong to the fighter-spaceship controlled by the player. Weapons can be fired at the enemy. They can also hit the enemy and be upgraded as the game goes along.

From a structural standpoint, the application can ask How the nuclear warhead is fired by the player, how the game will know when they hit the enemy, what happens when they hit the enemy (explosion, more points, what happens to the enemy when it is hit, and so on...), and how they are upgraded (at perhaps a certain number of game points, perhaps something is dropped by the enemy when it is destroyed, you decide.).

From a visual standpoint, things can be asked like when the nuclear warheads are fired by the player from the fighter-spaceship, what does it look like? What does it sound like? Obviously, there is no room for me to expand this entirely, even from this tiny aspect of the programming, but I think the idea is clear enough.

Given the visual descriptions (which might again be polled if it uses some kind of gerundive like flashing or burning, which hints at an animation or special effect), the application can define enough of the details that this portion of the design can be distributed to the artist, and the artist will know what to draw and how many drawings will need to be made for just this portion.

Given the structural descriptions, the application can build its own state machine for the game, determine all the modules which need to be built (ones that launch warheads, ones which handle aliens being hit by warheads, etc.), and even begin on assembling actual program control based on enough of this high-level context. In fact, if this program is used more than once, it is possible that a library can be established over time, containing code modules to handle frequent events, and given the hard drive space, this sort of application could truly blossom into a powerhouse of application building.

The example of the video game seems a bit complex. Games are complex. But, I wanted to show the flexibility that has to be considered for a project like this. Try to consider a simple data entry program with a graphical interface. The documentation for this is fairly short, assuming it is primitive. There's some entry functions, access functions, program control, file I/O, and maybe printing facilities. There are gadgets and menu options on a single window. There is likely little artwork or sound effects, no animations, and no musical score (unless you are weird). Still, it is asking a lot of our minds to constantly have a clear picture of the entire application that doesn't change or get confused or forget parts. If we do a layout of all the buttons and text fields, what happens if somewhere down the line, we forget to add a button? The further down the line, the more code is lost, and the greater chance that there will be a bug. Right?

The actual specifications for this application builder are somewhat beyond the range of a document meant to inspire programmers into using Flex an Bison in new and innovative ways, and I appologize for not being able to do this, but like the anagram problem, the implementation is tremendous, because details simply are not covered.

I hope you can appreciate, as I have, the value in these small programs, and use them to their fullest potential. However, I have one frightening word of caution about all of this. GNU is part of the Free Software Foundation. You are well advised to read the legal preambles to the use of Flex and Bison. You may discover that your application is subject to being only freeware, much to the chagrin of your monumental intentions. I believe there is a point where significant modifications made on the source code generated by Flex and Bison will result in it no longer being under these rules, however, if you are thinking of doing a search and replace on variable names, moving functions around, and a few other tricks, just remember what these programs are designed to do, and think to yourself that GNU probably has some sort of way of testing for typical changes to their code. There is a light at the end of this tunnel, though. The source code for these applications is freely distributable, and so modifications of that source code and compilations of your own de-FSF-ized versions of Flex and Bison could result in a nullification of GNU's hold over what you write with their programs. If you are serious, talk to a lawyer, and don't take the word of a college student (me) on what the law states.

A simple resolve to this (assuming my last theory holds in a court of law) might be to make an implementation of Flex and Bison that makes specific calls to the Amiga libraries, and not the ANSI C text I/O functions. Whatever the case, just remember to check with your lawyer before putting a pricetag on it.

If you have read this entire document and are still at a loss as to what Flex and Bison can do for you, here are some ideas worth exploring:

Write a program that scans in AmigaBasic programs and converts them to C source code, so all your Basic code can have a second life.

Write a hard disk organizer which reads in all of your .info files and does such things as intuitively assign your favorite text reader as the default tool to all the text files, or your image viewer for image files. How many times have you gotten an archive off the net with lots of little text files, only to face the fact that you have to edit each .info files yourself? Wouldn't it be nice if something were to run and look for references to known text editors (and their paths) and reassign them to what you actually have?

Write a C parser which inputs your old source code and makes an attempt at optimizing readiblity by modularizing your 1000 line functions into smaller pieces, then placing them in separate source files.

Write a text encryption program however you wish.

Write a program that builds form letters based on a core letter, variable points, and a separate file formatted with those variables (people's names, addresses, etc.).

If you have to distribute source code with your application and you don't want people to understand it, write your own program which muddles the readability of your code by stripping comments, using incremental nondescript variable names, and replacing text segments with long, tedious strings of ASCII hex codes.

Write a program which makes a feeble attempt at commenting Assembly code on the Amiga, by identifying what certain base adresses represent in memory operations or library calls, cite what the value of certain registers are (based on where they were originally input) and what they change to), and generate a separate program flow chart, perhaps in the form of a real picture.

Write a floppy-disk organizer for floppy-based systems which stratifies individual disks based on file content (a disk full of all your letters to mom, another with your collection of sound effects), and optimizes space usage to fill each disk to the brim, or to whatever capacity you wish to have remaining, then swaps around all your files for you.

Please don't let your own list be this list alone. It shouldn't stop here. I encourage you to talk with your fellow Amiga programming peers and divas about solutions to some of these problems. In these uncertain times, the best thing we can do for the machine we love is make programs which don't exist anywhere else, and pioneer high-tech, clever software which the end user has just got to have, because it is too cool to do without. If you have questions about Flex or Bison, are having trouble installing your free GCC compiler, or you need some kind of informed response to a question that covers the issues I mentioned, feel free to ask. I will do my best to answer as my own work schedule allows. But, before I get barraged with questions about shift-reduce and reduce-reduce errors, let me say this much: when you are using these languages, implement the technique of stepwise refinement; get a small part of it to work concretely, then add to that portion. It is tempting to lay things out all at once, pray, and compile, but when you have 100 lines of Bison rules, and over 400 reduce-reduce conflicts, it really does take less time to start over, using your rules as a template to stepwise refinement. You will learn the (poorly explained anywhere you look) fickleness of token look-aheads and impossible or vague grammars if you deal with your errors one at a time. And remember, even though there may be hundreds of conflicts, there may be only one error in your grammar.

About the author

Joe C. Solinsky was a student at the University of California at Riverside for 4 years. He is currently doing research at the University in robot simulators as a graphics programmer, but is looking to move out of the Ivory Tower of college, and into the real world again. As an Amiga user since 1987, he is die-hard about his favorite machine. He can be reached the following ways:

jcsky@cs.ucr.edu
joe@mindesign.com
2442 Iowa Avenue Apartment K-8
Riverside, CA 92507
USA
(909) 788-5408